Image classification method and apparatus, and method and apparatus for improving training of an image classifier

ABSTRACT

An image classification method comprises: extracting a logic program from a CNN, trained to classify features in images, which is a symbolic approximation of outputs of kernels at an extraction layer of the CNN; deriving kernel-based classification rules; forward-propagating pairs of feature-labeled images through the logic program to obtain kernel activations at the extraction layer for features in the images, where the scene in one of the pair contains a particular feature and the other is of the same scene without the feature; and calculating the correlation between each kernel in the logic program and each feature in the feature-labeled images using the kernel activations and the features associated with the feature-labeled images, assigning to each kernel in the logic program the label of the feature with which the kernel has the highest correlation, and applying the assigned kernel labels to the kernels in the rules to obtain kernel-labeled rules.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority from EP 21189010.8, filed on Aug. 2,2021, the contents of which are incorporated by reference herein in itsentirety

Embodiments relate to an image classification method and apparatus, anda method and apparatus for improving training of an image classifier.

The field of neural-symbolic integration concerns the relationshipbetween symbolic models, for example propositional logic programs, andneural networks. In other words, it concerns explainable artificialintelligence with respect to neural networks. This body of work includesthe tasks of translating knowledge from one form of representation tothe other, e.g. translating logic programs into neural networks that maybe trained inductively by observation of training samples; ortranslating the weights of trained neural networks into logic programsso that the decisions made by neural networks may be more easilyunderstood by humans. Rules will generally describe how therelationships between individual features (represented by individualneurons) contribute to individual class activations or to theobservation of other features as evidenced by the activations ofcorresponding neurons.

In more recent years convolutional neural networks (CNNs) have become apopular way to perform image classification. Most efforts to explain thebehaviour of CNNs have involved visualising regions of the input imagethat are most important or relevant to a given classification. Whileuseful, some limitations are:

-   -   Such explanations are only local in that they explain individual        samples and not the model as a whole (“global” explanations).    -   They do not provide much insight into the relationship between        features as earlier neural-symbolic models do.    -   They do not directly provide much insight into the inner        workings of the CNNs

With respect to the last point, some methods allow one to visualise whatan individual kernel responds to. A simple way to do this is tovisualise (i.e. create an image corresponding to) the output of a kerneland use this to generate a mask over the original image, but moresophisticated methods will backpropagate some signal from the kernel,through the weights and activations that led to it, and back to theinput image. While these allow decomposition of the model for a betterunderstanding of its inner workings, these methods still do not provideinsight into the interaction between features represented by otherkernels.

Methods exist which describe CNN classification decisions in the form ofgraphs or trees. These do describe the relationships between differentfeatures. However, these do not allow for the expression of negatedsymbols (e.g. ¬A as opposed to A). A way of training kernels to beinterpretable has also been proposed. However, the symbolic conceptrepresented by a kernel using this method may belong to only one class.Also, it assumes that the CNN has been trained in a specific way.Another method for explaining CNN behaviour learns a prototype layer,which represents inputs in terms of similar components to traininginputs, where each component is represented by a specific kernel in theprototype layer. However, this again assumes a specific training methodand a specific type of layer. There may however be situations in whichone may want to explain any CNN, not only those with specificarchitectures and/or that have been trained in any specific way.

In EP3291146 a method is proposed to extract logic programs fromconvolutional neural networks so that those logic programs may beregarded as explaining the behaviour of the corresponding CNNs. Thisovercomes the problems listed in the previous paragraph; i.e. it allowsfor negation of symbols, for symbols represented by kernels to beassociated with multiple classes, and does not assume any specifictraining method or architecture beyond what is common for CNNs (thoughsuch training methods may still improve accuracy).

As shown in FIG. 1 of the accompanying drawings, in this method eachkernel in the CNN is quantised by first mapping its output to a singlevalue, regarded as that kernel's activation value, by applying an L1 orL2 norm to its activation map and then applying a binary threshold tothat activationAs shown in Figure.

FIG. 2 of the accompanying drawings shows an example CNN M andcorresponding extracted logic program M′, extracted using theaforementioned method. A logic program to approximate the behaviour ofthe CNN M is extracted by first applying the quantisation function toall kernels to participate in the program, and then applying a decisiontree extractor to each binarised kernel and its inputs to discoverlogical rules which describe the conditions for which each of thosekernels ‘activate’. The aggregation of these rules constitutes theexplanation of the overall CNN. In FIG. 2 , a program is only extractedto describe the third layer (“the extraction layer”), but multiplelayers could be included. Any convolutional layers preceding theextraction layer remain in M′, so that kernel activations may beobtained for quantisation into binary truths, as in FIG. 1 .

However, extracted rules lack meaning without labels assigned to thekernels, which remains an open problem. The problem of labellingconvolutional kernels is a CNN-specific version of the more general‘symbol grounding problem’—the question of the origin of the meaning ofa symbol.

It has been proposed that labels could be assigned by visualising akernel's output and providing this visualisation as an input to a secondclassifier trained on a more fine-grained dataset (henceforth referredto as a “support” dataset) in order to attribute those class labels tothose kernels. This is illustrated in FIG. 3 of the accompanyingdrawings. The visualisation method may be the direct method, or a moresophisticated visualisation method.

Among the more sophisticated visualisation methods areperturbation-based methods, which deduce the importance of a featurebased on the effect on classifier output when that feature is added,removed or modified. For example, an image may be modified by croppingout a region of the image, and if the network changes its decision withrespect to the class of the input, then that region is regarded asimportant. Otherwise, it is not.

Recently, a method has been proposed which performs perturbation throughinpainting. Inpainting is a method often used to automate the generationof training data. A model is designed and/or trained to paint a featurein, or paint a feature out of, an image and thus be used to generate adataset for the purpose of training another model to detect the presenceor absence of that feature. However, in this case, it is used not(necessarily) for training, but to determine the importance of a featurewith respect to a trained network's decision process. This allows formore ‘realistic’ perturbations than simply cropping out regions of theimage.

The previously proposed method for assigning labels to kernels using asupport dataset is computationally expensive, as both the visualisationmethod and the kernel classifier must be applied once for each image ofthe support dataset and for each kernel for which a label is to beassigned. This is especially expensive for the more sophisticatedvisualisation methods, which back-propagate some signal from the kernelto be visualised back onto the input image.

Furthermore, inpainting is an expensive process if it is not needed. Forexample, it is only used to generate training data if an adequatetraining set has not been acquired. The same is true of using it as ameans of perturbation-based feature importance calculation; it would bebetter to use real photographic datasets that represent the presence orabsence of features in otherwise unchanging scenes, if such data wereavailable.

It is therefore desirable to be able to assign meaningful labels tokernels in extracted rules more efficiently.

According to an embodiment of a first aspect there is provided acomputer-implemented image classification method comprising: obtaining aconvolutional neural network, CNN, trained to classify features inimages using a training image dataset; extracting a logic program fromthe CNN, the logic program being a symbolic approximation of outputs ofkernels at an extraction layer of the CNN, and deriving from the logicprogram rules which use the kernels to explain the classification ofimages by the CNN; obtaining a feature-labeled image dataset, and arecord of each feature associated with each feature-labeled image in thedataset, where the images in the dataset comprise pairs offeature-labeled images, one feature-labeled image of the pair being of ascene containing a feature and the other feature-labeled image of thepair being of the same scene without the feature; forward-propagatingthe pairs of feature-labeled images through the logic program to obtainkernel activations at the extraction layer for features in the images;and calculating a correlation between each kernel in the logic programand each feature in the feature-labeled images using the obtained kernelactivations and the features associated with the feature-labeled images,assigning to each kernel in the logic program the label of the featurewith which the kernel has the highest correlation, and applying theassigned kernel labels to the kernels in the derived rules to obtainkernel-labeled rules.

Embodiments provide a new approach to assigning symbolic labels tokernels in convolutional neural networks (CNNs), so that those labeledkernels may be manipulated by a logic program. In contrast to theprevious approach which proposed classifying the output of individualkernels for each image from a labeled ‘support dataset’ set aside forthis purpose, performance is improved by only requiring the kernelclassifier to be applied once per kernel after all support images havebeen processed.

In particular, in both the above-mentioned prior art method and thepresent embodiments, kernel labelling may involve forward propagating alabeled training image from the support dataset and quantising kerneloutputs. However, in the previously-proposed method a classificationmust be applied to each kernel and for each support image, whereas in amethod according to an embodiment it is possible only to annotate atable to identify which kernel was activated in that image. After allsupport images have been processed, classification only needs to beperformed once per kernel by selecting as the label a tag assigned tothe image (for example, during manual labelling) that correlates moststrongly with that kernel's activation. This is based on the assumptionthat kernels for which activation drastically changes between the twoimages may be argued to correspond to that tag/feature. To realise thisthe network is presented with at least two versions of an image from thesupport dataset during the labelling process, one with and one without agiven tagged feature but otherwise identical.

Thus, the complexity of the previously proposed approach to labellingkernels is reduced, as it is no longer necessary to apply a classifieronce per kernel per image, which in turn reduces demand on computationalresource.

According to an embodiment of a second aspect there is provided acomputer-implemented method of improving training of an imageclassifier, the method comprising: for a convolutional neural network,CNN, trained to classify features in images, obtaining kernel-labeledrules which have been derived from the CNN using the method embodyingthe first aspect; for at least one image not forming part of thetraining image dataset used to train the CNN or the feature-labeledimage dataset used to derive the kernel-labeled rules, obtaining aclassification of the at least one image determined by the CNN, whichclassification has been assessed as being incorrect, and identifying arule of the kernel-labeled rules which is associated with the incorrectclassification; and causing the CNN to be retrained using furthertraining images containing features corresponding to the kernel labelsof the rule associated with the incorrect classification.

According to an embodiment of a third aspect there is provided acomputer program which, when run on a computer, causes that computer tocarry out a method embodying the first and/or second aspect.

According to an embodiment of a fourth aspect there is provided imageclassification apparatus comprising: at least one memory to store: (a) aconvolutional neural network, CNN, trained to classify features inimages using a training image dataset, and (b) a feature-labeled imagedataset and a record of each feature associated with eachfeature-labeled image in the dataset, where the images in the datasetcomprise pairs of feature-labeled images, one feature-labeled image ofthe pair being of a scene containing a feature and the otherfeature-labeled image of the pair being of the same scene without thefeature; and at least one processor, connected to the memory, to:extract a logic program from the CNN stored in the memory, the logicprogram being a symbolic approximation of outputs of kernels at anextraction layer of the CNN, and derive from the logic program ruleswhich use the kernels to explain the classification of images by theCNN; forward-propagate the pairs of feature-labeled images from thefeature-labeled dataset stored in the memory through the logic programto obtain kernel activations at the extraction layer for features in theimages; and calculate a correlation between each kernel in the logicprogram and each feature in the feature-labeled images using theobtained kernel activations and the features associated with thefeature-labeled images, assign to each kernel in the logic program thelabel of the feature with which the kernel has the highest correlation,and apply the assigned kernel labels to the kernels in the derived rulesto obtain kernel-labeled rules.

According to an embodiment of a fifth aspect there is provided apparatusto improve training of an image classifier, the apparatus comprising: atleast one memory storing: for a convolutional neural network, CNN,trained to classify features in images, kernel-labeled rules which havebeen derived from the CNN using apparatus embodying the fourth aspect ora method embodying the first aspect; and for at least one image notforming part of the training image dataset used to train the CNN or thefeature-labeled image dataset used to derive the kernel-labeled rules, aclassification of the at least one image determined by the CNN, whichclassification has been assessed as being incorrect; and at least oneprocessor, connected to the memory, to: identify a rule of thekernel-labeled rules which is associated with the incorrectclassification; and cause the CNN to be retrained using further trainingimages containing features corresponding to the kernel labels of therule associated with the incorrect classification.

In a method embodying the first aspect or apparatus embodying the fourthaspect, images in the feature-labeled image dataset may comprise stillframes from at least one video recording. The at least one videorecording may have been captured by a closed circuit television (CCTV)camera.

The manual effort required to label the support dataset may be reducedby exploiting the fact that frames from still video/CCTV cameras capturethe differences between the presence, absence or change of entities inthe camera view when the background is stationary, and the fact thatkernel activations will differ accordingly. The use of videos from stillcameras is a much more efficient way of generating scene perturbations,because the perturbations are natural and do not require objects to bemanually placed. For example, to identify kernels which relate to carsit would be laborious to have to drive cars in and out of camera shot toobtain ‘with’ and ‘without’ images. However, CCTV in a car park observescars arriving and leaving all the time.

By taking training frames from static video cameras as the supportdataset, three shortcomings may be overcome:

-   -   Labelling frames from such videos is a less laborious task than        labelling a number of images equal to the total number of frames        across all the videos, as one need only label instances of        features appearing or disappearing, i.e. frames only require        annotating when objects enter or exit a scene, as opposed to        manually tagging individual frames.    -   The lack of realism of naïve perturbations, such as cropping out        or blurring tagged objects, is avoided.    -   Reducing computational load required by automatic inpainting        methods for adding or removing objects—the need to consume extra        processing for inpainting methods to artificially add or remove        tagged objects is avoided, because objects in videos enter and        exit scenes of their own accord.

That said, a user may still use still images and/or artificiallyperturbed images or frames in embodiments if they so wish.

The use of images from still cameras to improve performance may make itespecially relevant to anyone who works with such hardware. For example,embodiments may be applied to obtain explainable classifications of CCTVfootage taken at airports, on transport networks and the like.

In a method embodying the first aspect or apparatus embodying the fourthaspect, the feature-labeled image dataset may comprise images annotatedfor semantic segmentation The record of each feature associated witheach feature-labeled image in the dataset may comprise a valuecorresponding to a total area occupied by the feature in the image.

Reference will now be made, by way of example, to the accompanyingdrawings, in which:

FIG. 1 (described above) is a diagram for use in explaining quantisationof each kernel in a CNN according to a prior art method;

FIG. 2 (described above) is a diagram for use in explaining extractionof a logic program from the CNN according to the prior art method;

FIG. 3 (described above) is a diagram for use in explaining a prior artmethod of assigning labels to kernels of a CNN;

FIG. 4 is a flowchart of a method according to an embodiment;

FIG. 5 is a flowchart of a pre-processing method for use with a methodaccording to an embodiment;

FIG. 6 is a diagram for use in explaining the pre-processing method ofFIG. 5 ;

FIG. 7 is a flowchart of a process used in a method according to anembodiment;

FIG. 8 is a diagram representing a series of still CCTV frames andassociated kernel activations;

FIG. 9 is a table illustrating a completed presence matrix correspondingto the frames of FIG. 8 ;

FIG. 10 is a table illustrating a completed kernel activation matrixcorresponding to the frames of FIG. 8 ;

FIG. 11 is a diagram representing the series of still CCTV frames andassociated kernel activations of FIG. 8 , with segmented regions;

FIG. 12 is a table illustrating a completed presence matrix and acompleted kernel activation matrix corresponding to the frames of FIG.11 ;

FIG. 13 is a table of feature-kernel correlation values corresponding tothe frames of FIGS. 8 and 11 ;

FIG. 14 is a table of extracted rules interpreted according to theprocess of FIG. 7 ;

FIG. 15 is a diagram for use in explaining a misclassified image;

FIG. 16 is a flowchart of a method according to an embodiment; and

FIG. 17 is a block diagram of a computing device suitable for carryingout a method according to an embodiment.

In an embodiment described below, initial training of a CNN is carriedout and a representative logic program is extracted, as in the priorart. To label the atoms in the logic program, their correspondingkernels are isolated, and changes in the activations of those kernels asimages from a support dataset are classified sequentially by the CNN areobserved. For example, a binary flip of a kernel's quantised activationas an object enters a scene, and again when it exits the scene, isregarded as evidence that the kernel is activated in response topresence of that object.

A high-level overview of the overall method is given in FIG. 4 , whichis described below.

1. Datasets and Pre-Processing

In Step 1 a problem dataset for training the CNN is obtained.

Obtaining Problem Dataset

The problem dataset is the original classification task the CNN to beexplained is trained to do. Thus, the problem dataset is a set of imagesplus their corresponding class labels, all partitioned into training,validation and test partitions. These images may be video frames.

In this example, at Step 1 a support dataset may also be obtained.Alternatively, this may be obtained at any time before Step 3.

Obtaining Support Dataset

FIG. 5 illustrates a pre-processing pipeline for the support dataset.

The support dataset will be used for the purpose of labelling kernels.It may be the same as the problem dataset, if the problem dataset meetsthe criteria for a support dataset as described in the followingparagraph.

For each instance of a feature, the support dataset must have at leastan instance of the scene both with and without that feature, with allother features in the image otherwise remaining the same. Thus, if thenumber of object instances is N, then the support dataset should ideallyhave at minimum 2*N images, plus a “presence matrix” which identifieswhich images correspond to the presence or absence of the feature. Waysof obtaining such data include (but are not limited to):

-   -   Take frames from a still video camera in which objects        enter/exit the scene (assumed method henceforth).    -   Taking a photograph of an object in a scene, removing the object        and then taking another photo.    -   For still images, use in-painting to add or remove objects

There are multiple options for completing the presence matrix for thesupport dataset, if a completed presence matrix has not already beenprovided. These include but are not limited to:

-   -   As shown in Table 1 (FIG. 9 ) with respect to the example in        FIG. 8 , marking the appearance or disappearance of features of        interest with 1 or −1 respectively on the frames in which these        corresponding transitions occur (Step 1.1. of FIG. 5 ). These        intervals are automatically filled with 1's in the presence        matrix (Step 1.2. of FIG. 5 ). This is the assumed method used        for examples described later in this document.

If the dataset has already been annotated for semantic segmentation(whether video or still images), a 1 is added to the presence matrix foreach object (i.e. feature) present in each frame. Alternatively, a valuecorresponding to the total area taken up by each object (feature) may beinput (Table 3 (FIG. 12 ), with reference to FIG. 9 ).

Frames and corresponding presence matrices may be generated frommultiple videos, in which case they are combined into a single datasetof n images/rows corresponding to one set of images and one presencematrix, as shown in FIG. 6 (Step 1.3. of FIG. 5 ).

2. Initial Training CNN

At Step 2, the CNN M is trained on the problem dataset in the usual way.

3. Knowledge Extraction

At Step 3, a logic program M′, which is a quantised/symbolicapproximation of M, is extracted from M, for example according to one ofthe above-described extraction methods (e.g. as shown in FIG. 2 ) or anymethod which produces quantised approximations of kernel outputs. Rulesfor explaining the CNN's classifications are derived from the logicprogram. However, the derived rules do not have labels assigned to thekernels.

4. Kernel Labelling

At Step 4 kernels in the symbolic approximation M′ are labeled. Anoverview of the kernel labelling process carried out at Step 4 is shownin FIG. 7 .

At Step 4.1 of FIG. 7 , each support image is forward-propagated thoughM′ to obtain the quantised kernel activations at the extraction layer,and those quantisations are recorded as in Table 2 (FIG. 10 ), in linewith the presence matrix values generated earlier (Table 1 (FIG. 9 )).

Alternatively, if segmentation annotations are available as mentionedabove, the presence matrix may be filled with the total area taken up bysegments of each class, and the kernel activation matrix filled with theactivation values before the thresholding part of quantisation (i.e.after L1 or L2 norms have been obtained, but before thresholding).Values for the current example are shown in Table 3 (FIG. 12 ), based onsegments shown in FIG. 11 .

Note that some kernels in the original CNN M may not have correspondingliterals in the symbolic approximation M′. Therefore, there is no needto generate labels for these literals/kernels.

Then, at Step 4.2, the correlation between each observed feature andeach kernel is calculated, for example according to the Phi Coefficient,Spearman's rank correlation coefficient, the Kendall rank correlationcoefficient or some other known method of calculating the correlationbetween two binary variables. Features and kernels which show no changewith respect to presence or activation may be excluded from thisprocess.

At Step 4.3 each kernel is assigned the label of the feature for whichit yields the highest correlation. The symbolic approximation M′ plusthe newly assigned labels is now referred to as M″.

In the case of segmented image datasets, a label may be assigned to akernel based on the correlation between the kernel activation strengthand the area of a segmented region pertaining to a class with the samelabel, provided the correlation metric used in this case may be appliedto continuous variables (e.g. Pearson or Spearman's).

The rules of M′ may therefore now be translated into terms which use theassigned kernel labels.

5. Inference

At Step 5, inference is carried out. If the symbolic approximation M′ isto be used for logic inference, classifications made by M may beexplained by executing the symbolic approximation M′ in parallel (as inthe prior art). However, we now use M″, not M′, so as mentioned abovethe atoms used in the rules/explanations use labels assigned using theproposed kernel labelling process.

If it is noted during inference, using live or test images, that animage has been classified incorrectly by the trained CNN, then the CNNis retrained using further training images. In this case the ruleassociated with the incorrect classification is used to determine whatfeatures are to be shown in the further training images. In particular,since the rule indicates the features which led the associated kernelsto activate, resulting in the misclassification of the image, retrainingthe CNN using more images showing the features concerned will help theretrained CNN to avoid such misclassification in future.

In particular, as shown in the process of FIG. 16 , training of an imageclassifier may be improved by, for a CNN trained to classify features inimages, in step S161 obtaining kernel-labeled rules which have beenderived from the CNN using a method according to an embodiment asdescribed above, in step S162 obtaining, for at least one image notforming part of the training image dataset used to train the CNN or thefeature-labeled image dataset used to derive the kernel-labeled rules,an incorrect classification of the at least one image determined by theCNN, and identifying a rule of the kernel-labeled rules which isassociated with the incorrect classification, and in step S163 causingthe CNN to be retrained using further training images containingfeatures corresponding to the kernel labels of the rule associated withthe incorrect classification.

Embodiments may be applied in any scenario where classifications are tobe made using video data. One example would be CCTV security cameras fordetecting security risks. These may need to be debugged if they yieldfalse positives that lead innocent parties to being wrongly accused of acrime. Another application might be automated video tagging, i.e. toexplain and diagnose incorrect tags.

A further application is to an Advanced Driver-Assistance System (ADAS).An ADAS is trained to recognise road scenes so that it may estimatelocal driving regulations if no traffic signs are visible and GPSconnection has been lost (which would normally be used to retrieve localinformation). For example, in the UK if the ADAS recognises the localscene as a residential street in the UK, the safest assumption is a 30mph speed limit. If it sees a motorway (highway), a 70 mph limit may beassumed. In the case of a school nearby, there is probably a 20 mphlimit.

In the event that a scene is misclassified, there is a risk that a carcould drive at an unsafe speed. For example, 70 mph in a residentialstreet or 30 mph on a motorway are both hazardous. Thus, whether suchinstances are observed during development and testing, or by a userusing the deployed system, these errors must be understood and correctedby the manufacturer.

Application to an ADAS of a method according to an embodiment will nowbe described.

Worked Example

CNN Training and Knowledge Extraction

The CNN is trained on a scene classification dataset (the problemdataset) and rules for explaining the CNN's classifications areextracted using the prior art method described with reference to FIG. 2or similar. However, the extracted rules do not have labels assigned tothe kernels. By default, the kernels are labeled with alphabeticalletters as in FIG. 2 . For example, a rule which identifies streetsappears as GΛE→Street. The example presented in FIG. 2 . will be usedfor the remainder of this scenario.

Kernel Labelling

A dataset of videos from fixed CCTV cameras is selected as the supportdataset. The system iterates through each video, using differencesbetween frames to support the labelling of kernels as described in theexample below.

Note that since the extracted logic program M′ does not include literalsfor D, H, I or L in this example, the corresponding kernels are excludedfrom the process and so we do not need to label them.

FIG. 8 illustrates 8 frames of a video taken by a CCTV camera, fixed onthe side of a building by a road, and corresponding changes in kernelactivations. Table 1 (FIG. 9 ) shows a presence matrix generated basedon raw annotations of when objects enter and leave the scene. Thefollowing narrative explains annotation and kernel activation inparallel, though in practice it is assumed that the annotation matrix(Table 1, left) would have been completed before executing theextraction process.

-   -   t=0: At the beginning of the video, a door, tree and some        windows are already in view, so the annotator will have marked        ‘1’ under these headings in the ‘raw annotation’ matrix. Three        kernels ‘C’, ‘E’ and ‘J’ are already active according to their        magnitudes with respect to a global threshold, implying they are        related to visible objects.    -   t=1: A person emerges from the door, and so the annotator will        have marked a ‘1’ for ‘person’. Although the door, tree and        windows are still in view, there was no need to mark ‘1’ for        these again as the system assumes they are still present unless        otherwise informed. This is reflected in the presence matrix        (Table 1, right), generated automatically from the annotation        matrix. One more kernel, ‘G’, has become active, suggesting a        relationship to the person who entered the scene.    -   t=2: The person has moved closer to the right of the camera        view. Meanwhile, no further objects have entered or left the        view and any changes in kernel activations are negligible.    -   t=3: The person moves closer to the edge still but remains in        view. A vehicle has entered the screen to the left, and the        annotator has marked ‘1’ to signify that this is a van entering        the scene. Kernel ‘A’ has become active but only by a narrow        margin.    -   t=4: The person begins to disappear and the activation of kernel        ‘G’ gets weaker (though nonetheless still active), further        supporting the evidence that this kernel corresponds to people.        More of the van appears in view as kernel A's activation gets        stronger, suggesting correlation here also. Furthermore, the van        occludes the tree and kernel D becomes inactive, suggesting that        E responds to trees. The annotator marked ‘−1’ for the ‘Tree’        label to indicate that it has disappeared and the ‘Tree’ column        of the presence matrix from t=0 to t=3 was automatically        populated with 1's.    -   t=5: The person has now left the scene and kernel G is inactive        again. The annotator marked ‘−1’ under ‘Person’ to state they        have disappeared and the ‘Person’ column of the presence matrix        is populated with 1's from t=1 (when the person appeared) to        t=4.    -   t=6: The van is still in view but has passed the tree, which is        no longer occluded. The annotator marked ‘1’ under ‘Tree’ to        indicate that it reappears in this frame, and kernel E is active        again. Kernel A's activation has weakened as the van begins to        exit the scene.    -   t=7: The van has now exited the scene and so the annotator has        marked ‘−1’ under ‘Van’ to indicate as such. Kernel E is        inactive again. The ‘Van’ column of the presence matrix for t=3        to t=6 (when the van appeared and disappeared respectively) is        set to 1. Also, as this is the end of the video, presence matrix        columns of all entities still visible were also populated with        1's: from t=1 to t=7 for the ‘Door’ and ‘Window’ (since these        never disappeared) and from t=6 to t=7 for the ‘Tree’ (starting        from when the Tree reappeared). Finally, note that the scene and        kernel activations have all returned to their original states as        seen for t=0.    -   This process is repeated for further training videos, with the        annotator having only needed to mark when entities in the camera        view appear or disappear. The presence or absence of these        entities in all other frames were automatically filled into the        presence matrix by the system.

Kernel Labelling (Segmentation-Based Alternative)

Alternatively, if segmentation annotations are available, the presencematrix could be filled with the total area (in pixels) taken up bysegments of each class, and the kernel activation matrix filled with theactivation values before quantisation. Values for the current exampleare shown in Table 3 (FIG. 12 ), based on segments shown in FIG. 11 . InFIG. 11 , each number in a segmented region corresponds to a differentcolour and therefore label. In Table 3 the presence and activationmatrices are completed according to frame segmentations shown in FIG. 11. That is, each value of the presence matrix corresponds to the areaoccupied by the corresponding feature in the corresponding frame. Forexample, at t=4, the van (7) occupies 40 pixels.

Correlation Matrix

After all training videos have been observed, the Pearson Correlationsbetween processed annotations and kernel activations are calculated asshown in Table 4 (FIG. 13 ). Maximum absolute values for each kernel areshown in bold, as each kernel will be assigned the label for which theabsolute feature correlation is strongest. Each kernel is then assignedthe label with which it has the strongest correlation according to thismatrix. The rules of M′ may now be interpreted as shown in Table 5 (FIG.14 ).

Inference

Later, after the trained and labeled ADAS software has been deployed ina car, a user is driving along a motorway (highway) only to discover thecar is advising him to slow down because the car thinks he is in aresidential street (see FIG. 15 ). In view of the camera is somebodybroken down next to a tree, waiting outside their car for roadsideassistance. The driver sees that the car has classified the scene as‘Street’, and the explanation associated with this classification isPerson ΛTree→Street, due to the presence of a person and a tree.

The user reports this error to the manufacturer, who upon inspecting theexplanation accepts that the rule is an unreasonable assumption to makeas it is highly likely that trees may be found by the roadside on themotorway, and although less likely, possible that people may be foundstood by the motorway in scenarios such as this.

With this explanation, the developer knows that their model must beretrained with more examples of motorways in which humans are waiting bytheir cars, and/or trees may be found by the roadside.

FIG. 17 is a block diagram of a computing device, such as a data storageserver, which embodies the present invention, and which may be used toimplement some or all of the operations of a method embodying thepresent invention, and perform some or all of the tasks of apparatus ofan embodiment. For example, the computing device of FIG. 17 may be usedto implement some or all of the processes described with reference toFIG. 4, 5, 7 and/or 16 .

The computing device comprises a processor 993 and memory 994.Optionally, the computing device also includes a network interface 997for communication with other such computing devices, for example withother computing devices of invention embodiments.

For example, an embodiment may be composed of a network of suchcomputing devices. Optionally, the computing device also includes one ormore input mechanisms such as keyboard and mouse 996, and a display unitsuch as one or more monitors 995. The components are connectable to oneanother via a bus 992.

The memory 994 may include a computer readable medium, which term mayrefer to a single medium or multiple media (e.g., a centralized ordistributed database and/or associated caches and servers) configured tostore information, such as the problem dataset, the support imagedataset, kernel-labeled rules, misclassified images, and/or images usedfor retraining, and/or carry computer-executable instructions.Computer-executable instructions may include, for example, instructionsand data accessible by and causing a general purpose computer, specialpurpose computer, or special purpose processing device (e.g., one ormore processors) to perform one or more functions or operations. Forexample, the computer-executable instructions may include thoseinstructions for implementing some or all of the steps shown in FIG. 4 ,FIG. 5 , FIG. 7, or FIG. 16 , or for implementing one or more of theprocesses described with reference to FIG. 4 or FIG. 5 or FIG. 6 or FIG.7 or FIG. 16 . Thus, the term “computer-readable storage medium” mayalso include any medium that is capable of storing, encoding or carryinga set of instructions for execution by the machine and that cause themachine to perform any one or more of the methods of the presentdisclosure. The term “computer-readable storage medium” may accordinglybe taken to include, but not be limited to, solid-state memories,optical media and magnetic media. By way of example, and not limitation,such computer-readable media may include non-transitorycomputer-readable storage media, including Random Access Memory (RAM),Read-Only Memory (ROM), Electrically Erasable Programmable Read-OnlyMemory (EEPROM), Compact Disc Read-Only Memory (CD-ROM) or other opticaldisk storage, magnetic disk storage or other magnetic storage devices,flash memory devices (e.g., solid state memory devices).

The processor 993 is configured to control the computing device andexecute processing operations, for example executing computer programcode stored in the memory 994 to implement the methods described withreference to FIG. 4 , FIG. 7 and/or FIG. 16 and defined in the claims.The memory 994 stores data being read and written by the processor 993.As referred to herein, a processor may include one or moregeneral-purpose processing devices such as a microprocessor, centralprocessing unit, or the like. The processor may include a complexinstruction set computing (CISC) microprocessor, reduced instruction setcomputing (RISC) microprocessor, very long instruction word (VLIW)microprocessor, or a processor implementing other instruction sets orprocessors implementing a combination of instruction sets. The processormay also include one or more special-purpose processing devices such asan application specific integrated circuit (ASIC), a field programmablegate array (FPGA), a digital signal processor (DSP), network processor,or the like. In one or more embodiments, a processor is configured toexecute instructions for performing the operations and operationsdiscussed herein.

The display unit 995 may display a representation of data stored by thecomputing device, such as images from the problem dataset, the supportimage dataset, misclassified images, and/or images used for retraining,and may also display a cursor and dialog boxes and screens enablinginteraction between a user and the programs and data stored on thecomputing device. The input mechanisms 996 may enable a user to inputdata and instructions to the computing device.

The network interface (network I/F) 997 may be connected to a network,such as the Internet, and is connectable to other such computing devicesvia the network. The network I/F 997 may control data input/outputfrom/to other apparatus via the network. Other peripheral devices suchas microphone, speakers, printer, power supply unit, fan, case, scanner,trackerball etc may be included in the computing device.

Methods embodying the present invention may be carried out on acomputing device such as that illustrated in FIG. 17 . Such a computingdevice need not have every component illustrated in FIG. 17 , and may becomposed of a subset of those components. A method embodying the presentinvention may be carried out by a single computing device incommunication with one or more data storage servers via a network. Thecomputing device may be a data storage itself storing at least a portionof the data.

A method embodying the present invention may be carried out by aplurality of computing devices operating in cooperation with oneanother. One or more of the plurality of computing devices may be a datastorage server storing at least a portion of the data.

The invention can be implemented in digital electronic circuitry, or incomputer hardware, firmware, software, or in combinations of them. Theinvention can be implemented as a computer program or computer programproduct, i.e., a computer program tangibly embodied in a non-transitoryinformation carrier, e.g., in a machine-readable storage device, or in apropagated signal, for execution by, or to control the operation of, oneor more hardware modules.

A computer program can be in the form of a stand-alone program, acomputer program portion or more than one computer program and can bewritten in any form of programming language, including compiled orinterpreted languages, and it can be deployed in any form, including asa stand-alone program or as a module, component, subroutine, or otherunit suitable for use in a data processing environment. A computerprogram can be deployed to be executed on one module or on multiplemodules at one site or distributed across multiple sites andinterconnected by a communication network.

Method steps of the invention can be performed by one or moreprogrammable processors executing a computer program to performfunctions of the invention by operating on input data and generatingoutput. Apparatus of the invention can be implemented as programmedhardware or as special purpose logic circuitry, including e.g., an FPGA(field programmable gate array) or an ASIC (application-specificintegrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processors of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read-only memory ora random access memory or both. The essential elements of a computer area processor for executing instructions coupled to one or more memorydevices for storing instructions and data.

The above-described embodiments of the present invention mayadvantageously be used independently of any other of the embodiments orin any feasible combination with one or more others of the embodiments.

Glossary of Terms Used in the Specification

ADAS—Advanced Driver-Assistance System: A combination of software andhardware included in an automobile that assists the driver withouttaking full control of the vehicle.

Feature attribution map (or ‘feature importance map’): A heatmap over animage that has been classified by a CNN (or other method) whichindicates the importance of each pixel in that image with respect to theoutput classification or with respect to the output activation(s) ofsome other component(s) of the classifier.

Inpainting: A category of image-processing methods for automaticallyfilling in missing image data with an estimation of the lostinformation, or replacing an entity in the image with an estimation ofthe background region it occludes.

Perturbation-based feature attribution: A method of generating a featureattribution map by perturbing the input image and observing the changein output classification or activation of the component of interest.

Presence Matrix: A term used to denote a table which represents whatfeatures or objects (columns) are present in which images/video frames(rows).

Problem dataset: The dataset on which the CNN to be explained wasoriginally trained. i.e. a dataset representing the problem domain forwhich the CNN is to be applied.

Support dataset: A dataset used for the purpose of kernel labelling.

1. A computer-implemented image classification method comprising:obtaining a convolutional neural network, CNN, trained to classifyfeatures in images using a training image dataset; extracting a logicprogram from the CNN, the logic program being a symbolic approximationof outputs of kernels at an extraction layer of the CNN, and derivingfrom the logic program rules which use the kernels to explain theclassification of images by the CNN; obtaining a feature-labeled imagedataset, and a record of each feature associated with eachfeature-labeled image in the dataset, where the images in the datasetcomprise pairs of feature-labeled images, one feature-labeled image ofthe pair being of a scene containing a feature and the otherfeature-labeled image of the pair being of the same scene without thefeature; forward-propagating the pairs of feature-labeled images throughthe logic program to obtain kernel activations at the extraction layerfor features in the images; and calculating a correlation between eachkernel in the logic program and each feature in the feature-labeledimages using the obtained kernel activations and the features associatedwith the feature-labeled images; assigning to each kernel in the logicprogram the label of the feature with which the kernel has the highestcorrelation; and applying the assigned kernel labels to the kernels inthe derived rules to obtain kernel-labeled rules.
 2. A method as claimedin claim 1, wherein images in the feature-labeled image dataset comprisestill frames from at least one video recording.
 3. A method as claimedin claim 2, wherein the at least one video recording was captured by aclosed circuit television, CCTV, camera.
 4. A method as claimed in claim1, wherein the feature-labeled image dataset comprises images annotatedfor semantic segmentation, and the record of each feature associatedwith each feature-labeled image in the dataset comprises a valuecorresponding to a total area occupied by the feature in the image.
 5. Acomputer-implemented method of improving training of an imageclassifier, the method comprising: for a convolutional neural network,CNN, trained to classify features in images, obtaining kernel-labeledrules which have been derived from the CNN using the method of claim 1;for at least one image not forming part of the training image datasetused to train the CNN or the feature-labeled image dataset used toderive the kernel-labeled rules, obtaining a classification of the atleast one image determined by the CNN, which classification has beenassessed as being incorrect, and identifying a rule of thekernel-labeled rules which is associated with the incorrectclassification; and causing the CNN to be retrained using furthertraining images containing features corresponding to the kernel labelsof the rule associated with the incorrect classification.
 6. Anon-statutory computer-readable medium comprising instructions which,when executed by a computer, cause the computer to carry out the methodof claim
 1. 7. Image classification apparatus comprising: at least onememory to store: (a) a convolutional neural network, CNN, trained toclassify features in images using a training image dataset, and (b) afeature-labeled image dataset and a record of each feature associatedwith each feature-labeled image in the dataset, where the images in thedataset comprise pairs of feature-labeled images, one feature-labeledimage of the pair being of a scene containing a feature and the otherfeature-labeled image of the pair being of the same scene without thefeature; and at least one processor, connected to the memory, to:extract a logic program from the CNN stored in the memory, the logicprogram being a symbolic approximation of outputs of kernels at anextraction layer of the CNN, and derive from the logic program ruleswhich use the kernels to explain the classification of images by theCNN; forward-propagate the pairs of feature-labeled images from thefeature-labeled dataset stored in the memory through the logic programto obtain kernel activations at the extraction layer for features in theimages; and calculate a correlation between each kernel in the logicprogram and each feature in the feature-labeled images using theobtained kernel activations and the features associated with thefeature-labeled images; assign to each kernel in the logic program thelabel of the feature with which the kernel has the highest correlation;and apply the assigned kernel labels to the kernels in the derived rulesto obtain kernel-labeled rules.
 8. Apparatus as claimed in claim 7,wherein images in the feature-labeled image dataset comprise stillframes from at least one video recording.
 9. Apparatus as claimed inclaim 8, wherein the at least one video recording was captured by aclosed circuit television, CCTV, camera.
 10. Apparatus as claimed inclaim 7, wherein the feature-labeled image dataset comprises imagesannotated for semantic segmentation, and the record of each featureassociated with each feature-labeled image in the dataset comprises avalue corresponding to a total area occupied by the feature in theimage.
 11. Apparatus to improve training of an image classifier, theapparatus comprising: at least one memory storing: for a convolutionalneural network, CNN, trained to classify features in images,kernel-labeled rules which have been derived from the CNN using theapparatus of claim 7; and for at least one image not forming part of thetraining image dataset used to train the CNN or the feature-labeledimage dataset used to derive the kernel-labeled rules, a classificationof the at least one image determined by the CNN, which classificationhas been assessed as being incorrect; and at least one processor,connected to the memory, to: identify a rule of the kernel-labeled ruleswhich is associated with the incorrect classification; and cause the CNNto be retrained using further training images containing featurescorresponding to the kernel labels of the rule associated with theincorrect classification.